-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Concatenation of DFs with all NaT columns and TZ-aware ones breaks #12396 #12403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -4802,6 +4802,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na): | |||
|
|||
if self.is_null and not getattr(self.block, 'is_categorical', | |||
None): | |||
if isinstance(empty_dtype, DatetimeTZDtype): | |||
# handle timezone-aware all NaT cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see if u can make this if else flow s bit more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i mean is make this an else-if so this cleanly handles the different dtypes (categorical, datetime with tz) and is obvious what is going on
Could do
|
@@ -4802,6 +4802,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na): | |||
|
|||
if self.is_null and not getattr(self.block, 'is_categorical', | |||
None): | |||
if isinstance(empty_dtype, DatetimeTZDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also blow away the numpy 1.6 workaround comment / code.
@hcontrast see what you can do here. This involves cleaning up some code and making it more generic to handle the extension dtypes. lmk if you get stuck. |
Ok. Will have to wait until the weekend though |
I modified some things in #12702 pls update/rebase when that's merged |
Rebased on top of master. Having trouble running tests locally though |
dtype = DatetimeTZDtype('ns', tz='UTC') | ||
first = pd.DataFrame([[pd.NaT], [pd.NaT]], dtype=dtype) | ||
|
||
result = pd.concat([first, second], 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always write out parameters axis=0
if you are passing them.
sorry, meant here. can you rebase/update? |
Updated the tests, but the mixed tz-/non-tz-aware cases are now breaking again: ======================================================================
ERROR: test_concat_NaT_dataframes_all_NaT_axis_0 (pandas.tools.tests.test_merge.TestConcatenate)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/.../dev/pandas/pandas/tools/tests/test_merge.py", line 2606, in test_concat_NaT_dataframes_all_NaT_axis_0
result = pd.concat([first, second], axis=0)
File "/.../dev/pandas/pandas/tools/merge.py", line 836, in concat
return op.get_result()
File "/.../dev/pandas/pandas/tools/merge.py", line 1024, in get_result
concat_axis=self.axis, copy=self.copy)
File "/.../dev/pandas/pandas/core/internals.py", line 4544, in concatenate_block_managers
for placement, join_units in concat_plan]
File "/.../dev/pandas/pandas/core/internals.py", line 4641, in concatenate_join_units
for ju in join_units]
File "/.../dev/pandas/pandas/core/internals.py", line 4907, in get_reindexed_values
missing_arr = np.empty(self.shape, dtype=empty_dtype)
TypeError: data type not understood where # pandas/tools/tests/test_merge.py:2601
# one side timezone-aware
dtype = DatetimeTZDtype('ns', tz='UTC')
first = pd.DataFrame([[pd.NaT], [pd.NaT]], dtype=dtype)
result = pd.concat([first, second], axis=0) Looks like accidental/eager upcasting on closer inspection
The second block is regular datetime but gets passed a tz-aware empty value, hence triggering the wrong case. |
ok I just pushed a change where we moved all of the concat stuff together (now in use this to handle the path
|
Thanks. I've applied that fix, only outstanding issue is the mixed Series-DF case, which doesn't preserve the dtype information correctly:
def test_concat_NaT_series_dataframe_all_NaT(self):
# GH 12396
# non-timezone aware
first = pd.Series([pd.NaT, pd.NaT])
second = pd.DataFrame([[pd.Timestamp('2015/01/01')],
[pd.Timestamp('2016/01/01')]])
expect = pd.DataFrame([pd.NaT, pd.NaT, second.iloc[0, 0],
second.iloc[1, 0]], index=[0, 1, 0, 1])
result = pd.concat([first, second])
assert_frame_equal(result, expect)
# one side timezone-aware
dtype = DatetimeTZDtype('ns', tz='UTC')
first = first.astype(dtype)
result = pd.concat([first, second])
assert_frame_equal(result, expect, check_dtype=True)
self.assertEqual(result.dtypes.iloc[0], second.dtypes[0])
# both sides timezone-aware
second = second.apply(lambda x: x.astype(dtype))
result = pd.concat([first, second])
assert_frame_equal(result, expect, check_dtype=True)
self.assertEqual(result.dtypes.iloc[0], second.dtypes[0]) |
that last already seems to work |
Might be due to the patch then |
@@ -4911,6 +4911,11 @@ def get_reindexed_values(self, empty_dtype, upcasted_na): | |||
pass | |||
elif getattr(self.block, 'is_sparse', False): | |||
pass | |||
elif com.is_extension_type(empty_dtype): | |||
num_elements = np.prod(self.shape) | |||
# handle timezone-aware all NaT cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do this if it's a datetime tz type
elif com.is_extension_type(empty_dtype) and \ | ||
com.is_datetimetz(empty_dtype): | ||
num_elements = np.prod(self.shape) | ||
# handle timezone-aware all NaT cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, do this under the if getattr(self.block, 'is_datetimetz', False) or com.is_datetimetz(empty_dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like so? This passes the tests but not sure how good the coverage is for the non-NaT tz-aware cases.
if getattr(self.block, 'is_datetimetz', False) \
or com.is_datetimetz(empty_dtype):
num_elements = np.prod(self.shape)
# handle timezone-aware all NaT cases
return DatetimeIndex([fill_value] * num_elements,
dtype=empty_dtype)
elif getattr(self.block, 'is_categorical', False):
pass
elif getattr(self.block, 'is_sparse', False):
pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes more like this. yes coverage of extension types is prob only so now
Current coverage is 84.14%@@ master #12403 diff @@
==========================================
Files 137 137
Lines 50287 50288 +1
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 42311 42313 +2
+ Misses 7976 7975 -1
Partials 0 0
|
ok (this is the original issue)
This is incorrect; we are coercing to object (correct), but the
|
@hcontrast can you update. only small fix I think. |
Added test case for this, indeed still breaking |
Can't actually get the test to break. This should break but doesn't first = pd.DataFrame([[pd.NaT], [pd.NaT]])
second = pd.DataFrame([[pd.Timestamp('2015/01/01', tz='UTC')],
[pd.Timestamp('2016/01/01', tz='US/Eastern')]])
expect = pd.DataFrame([pd.NaT, pd.NaT, second.iloc[0, 0],
second.iloc[1, 0]], index=[0, 1, 0, 1])
result = pd.concat([first, second], axis=0)
assert_frame_equal(result, expect)
print expect
print result gives
|
This is a tricky one; we have this flag to test exactness (so do this for now).
@sinhrks you think we should be checking for this by default? (I forgot why I added this, IIRC maybe for stata tests or something) |
Or we should change |
@hcontrast can you update |
can you rebase and update |
can you rebase & update |
can you rebase / update? |
closing, but pls reopen / comment if you can continue |
this is almost there ..... |
Want to knock this one out in conjunction with #18447. Just one question, what's the general desired behavior for missing data in concatenations? Should existing NaTs and nans always remain as they are, and otherwise use whatever the fill value is for the block for any new missing data created by the concatenation? |
nan is always by type the tests look pretty good here |
go ahead and open a new PR |
git diff upstream/master | flake8 --diff
concat
to handle tz-aware all NaT/mixed cases